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ABSTRACT 

The  following  problem  is  considered:  given  a  linked  list  of  length  n,  copoute 
the  distance  of  each  element  of  the  linked  list  from  the  end  of  the  list.  The 
problem  has  two  standard  deterministic  algorithms:  a  linear  time  serial  algorithm, 
and  an  O((nlog  n)/p  +  log  n)  time  parallel  algorithm  using  p  processors.  A  known 
conjecture  states  that  it  is  impossible  to  design  an  O(log  n)  time  deterministic 
parallel  algorithm  that  uses  only  n/log  n  processors. 

We  present  three  randomized  parallel  algorithms  for  the  problem.  One  of 
these  algorithms  runs  almost-surely  in  time  of  0(n/p  +  log  nlog^n)  using  p 
processors  on  an  exclusive-read  exclusive-write  parallel  RAiM. 


*This   research  was  supported  by  DOE  grant  DE-AC02-76ER03077  and  by  NSF  grant 
NSF-MCS79-21258. 

**To  be  presented  at  the  16th  STOC. 


Introduction 


The  family  of  models  of  computation  used  in  this  paper  is  the  parallel 
randora-access-machines  (PRA!1s).  All  members  of  this  family  employ  p  synchronous 
processors  all  having  access  to  a  common  memory.  The  PRAaM  family  has  3  notable 
members.  In  a  concurrent-read  concurrent-write  (CRCW)  PRAM  simultaneous  reading 
from  the  same  memory  location  is  allowed  as  well  as  simultaneous  writing.  In  the 
latter  case  the  lowest  numbered  processor  succeeds.  A  concurrent-read 
exclusive-write  (CREW)  PRAM  allows  simultaneous  reading  into  the  same  memory 
location  but  not  simultaneous  writing.  An  EREW  PRAM  does  not  allow  simultaneous 
reading  or  writing.  See  [Vi-83a]  for  a  recent  survey  of  results  concerning  the 
PRAM  family. 

Let  Seq(n)  be  the  fastest  known  worst-case  running  time  of  a  sequential 
algorithm,  where  n  is  the  length  of  the  input  for  the  problem  being  considered. 
Obviously,  the  best  upper  bound  on  the  parallel  time  achievable  using  p  processors 
without  improving  the  sequential  result  is  of  the  form  0(Seq(n)/p).  \  parallel 
algorithm  that  achieves  this  running  time  is  said  to  have  optimal  speed-up  or  more 
simply  to  be  optimal.  An  ideal  goal  for  serial  computation  is  to  design  linear 
time  algorithms  (0(n)  time)  An  analogous  ideal  goal  for  parallel  computation  is  to 
design  algorithms  whose  running  time  is  proportional  to  n/p  ,  where  p  is  the 
number  of  processors  used.  In  this  case  we  say  that  a  parallel  algorithm  achieves 
parallel  linear  running  time. 

The  following  problem  is  considered.   (See  also  Fig.   1). 
Input.   A  linked  list  of  length  n.  It  is  given   in   an  array   of   length   n,   not 
necessarily   in   the  order  of  the  linked  list.   Each  of  the  n  elements  (except  the 
last  element  in  the  linked  list)  has  a  pointer  to  its  subsequent   element   in   the 
linked  list. 
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The  list-ranking  problem.  Compute  for  each  element  its  distance,  counting 
elements,  from  the  end  of  the  list. 

The  problem  has  a  trivial  linear  time  serial  algorithm.  However,  Wyllie 
[W-79]  conjectured  that  fi(n)  processors  are  required  in  order  to  get  O(log  t") 
time.  If  true  this  implies,  in  particular,  that  there  is  no  optimal  speed-up 
parallel  algorithm  for  n/log  n  processors.  It  is  further  conjectured  that  optimal 
speed-up  is  impossible  even  for  less  than  n/log  n  processors  (ve  leave  open  for 
how  much  less). 

The  goal  of  this  paper  is  to  relax  the  gap  between  the  apparently  most 
efficient  deterministic  parallel  algorithm  and  the  ideal  goal  of  optimal  speed-up 
by  randomized  parallel  algorithms.  Our  algorithms  obtain  the  running  times 
mentioned  below  with  probability  that  converges  rapidly  to  one  as  n  grows.  Our 
strongest  results  are: 

(1)  A  parallel  algorithm  that  runs  in  0(n/p)  time  using  p  <  n/(log  n  log  n) 
processors  on  an  EREW  PRAM.  (Observe  that  this  algorithm  achieves  optimal 
speed-up).  Recall  that  log  n  grows  extremely  slow  and  can  be  viewed  as  a  constant 
for  all  practical  purposes.  (For  instance,  log  2  =  5  .  See  the  function  G 
in  [AHU-74],  p.  133).  In  particular,  it  runs  in  "about"  O(log  n)  time  using 
"about"  n/log  n  processors. 

(2)  An  O(log  n)  time  algorithm  using  nloglog  n/log  n  processors  on  a  CRCW  PRAM. 

The  list  of  optimal  speed-up  parallel  algorithms  obtained  so  far  is  fairly 
short  in  spite  of  the  interest  in  them.  Let  us  mention  the  few  known  parallel 
linear  algorithms:  computaion  of  partial  "sums"  of  n  variables,  where  the  word 
"sum"  stands  for  any  associative  binary  operation  (this  obvious  algorithm  is 
stated  in  the  next  section),  [SV-81]  for  finding  the  maximum  among  n  elements  and 
merging,  [Vi-83b]  for  finding  the  k  smallest  out  of  n  elements,  [G-84]  for  string 
matching,  [CLC-81]  and  [Vi-81]  for  computing  connected  components  of  dense  graphs, 
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[TC-82]  and  [TV-83]  for  computing  hiconnected  components  of  dense  graphs  and 
[BV-83]  for  generation  of  a  computation  tree  form  of  an  arithmetic  expression  and 
for  finding  matches  in  a  sequence  of  parentheses.  In  addition,  there  are  optimal 
speed-up  algorithms  for  two  more  problems:  [AKS-83]  and  [SV-81]  for  sorting  and 
[PVW-83]  for  various  operations  on  2-3  tree. 

Finding  examples  where  randomized  algorithms  apparently  beat  the  performance 
of  their  best  possible  deterministic  counterparts  is  considered  an  interesting 
question  in  computational  complexity.  There  are  only  a  few  examples  where 
randomization  is  proven  to  be  complexity  effective.  See,  for  instance,  [Ra-S3] 
and  [MV-83].   See  [Ra-76]  for  more  on  this  concept  of  randomization. 

Randomization  in  parallel  computation.  [Re-81]  gave  a  randomized  algorithm 
for  selection  of  the  k  smallest  element  out  of  a  set  of  n  elements  in  a  decision 
tree  model  of  parallel  computation.  It  runs  in  0(1)  time  using  n  processors  with 
probabitily  that  converges  rapidly  to  one.  [Me-82]  showed  that  it  is  possible  to 
use  similar  ideas  to  [Re-81]  and  [SV-81]  in  order  to  get  in  the  CRCW  PRAM  an  0(1) 
time  algorithm  for  finding  the  maximum  among  n  elements  using  n  processors  with  a 
similar  probability.  This  is  particularly  interesting  since  [Va-75]  proved  that  n 
processors  need  f2(loglog  n)  time  in  a  parallel  comparison  model  of  computation  in 
order  to  find  the  maximum  among  n  elements.  The  selection  algorithm  of  [Re-81] 
can  be  implemented  (in  a  straightforward  manner)  to  run  on  the  EREW  PRAM  in  0(n/p) 
time  using  p  <  n/log  n  processors  with  a  similar  probability.  We  do  not  know  if  a 
deterministic  algorithm  can  match  this  result.  (The  deterministic  selection 
algorithm  of  [Vi-83b]  runs  in  time  0(n/p)  time  using  p  <  n/(log  nloglog  n) 
processors).  See  also  [RV-83]  for  a  randomized  sorting  algorithm  that  involve 
parallel  processors. 

In  the  paper  we  actually  present  three  algorithms.  'Vhile  the  deterministic 
parts  of  all  these  algorithms  are  similar  they  differ  considerably  in  the   way   in 
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which  randomization  is  applied.  This  facilitaCes  a  comparacive  demonstration  of 
Che  role  of  randomization  in  this  instance  of  parallel  computation;  i.e.,  we  can 
focus  on  the  contribution  of  the  specific  way  in  which  randomization  is  applied  ii 
each  ojf  these  algorithms.  The  simplicity  of  the  definition  of  our  problem  helps 
also  in  this  direction.  We  believe  that  randomization  will  become  an  important 
tool  in  the  design  of  efficient  parallel  algorithms. 

The  list  ranking  problem  is  encountered  often  in  the  design  of  oarallel 
algorithms.  For  instance,  some  of  the  tree  procedures  in  the  biconnect ivity 
algorithm  of  [TV-83]  may  have  now  the  same  efficiency  as  the  randomized  algorithms 
presented  here. 

Remark.  [CSV-82]  classified  many  problems  with  respect  to  how  fast  they  can 
be  solved  by  CRCW  PRAM  algorithms  using  a  polynomial  number  of  processors.  "e 
show  a  reduction  from  sorting  into  the  list  ranking  problem  that  preserves  time  up 
to  a  constant  factor  and  number  of  processors  up  to  a  polynomial.  Say  that  we 
wish  to  sort  an  array  of  n  numbers  A( 1 ) , A( 2) , . . . ,A(n) .  For  each  \(i)  compute  in 
0(1)  time  the  largest  number  which  is  smaller  than  A(i)  using  the  constant  time 
algorithm  of  [SV-81]  for  finding  the  maximum  among  n  elements.  This  results  in  a 
linked  list  that  has  to  be  "ranked"  as  required  in  our  problem.  We  do  not  know  if 
a  reduction  in  the  reverse  direction  exists.  Moreover,  [CSV-32]  give  an 
O(log  n/loglog  n)  time  algorithm  for  sorting  using  a  polynomial  number  of 
processors,  but  we  do  not  know  whether  such  an  upper  bound  can  be  obtained  for  the 
list  ranking  problem. 

Among  other  things,  the  next  section  recollects  an  optimal  speed-up 
deterministic  parallel  algorithm  that  uses  balanced  trees.  It  is  used  later  for 
two  purposes:  (1)  as  a  subroutine,  and  (2)  for  explanation  of  the  randomized 
parallel  algorithms.  \  careful  look  shows  that  all  our  algorithms  essentially 
manipulate  randomization  into  the  (sometimes  remote)  framework  of  this  algorithm. 
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A  randomized  parallel  algorithm  whose  running  time  is  O(log  nloglog  n)  (with 
probability  that  converges  rapidly  to  one)  using  n/(log  nloglog  n)  processors  is 
presented  in  Sec.  3.  The  algorithm  given  in  Sec.  4  improves  this  result.  Its 
running  time  is  O(log  nlog  n)  (with  probability  that  converges  rapidly  to  one") 
using  n/(log  nlog  n)  processors.  \n  O(log  n)  time  algorithm  (with  probabilitv 
that  converges  to  one)  using  nloglog  n/log  n  processors  is  presented  in  Sec.   5. 

II.  Preliminaries 

Theorem  (Brent).  Any  synchronous  parallel  algorithm  of  time  t  that  consists 
of  a  total  of  X  elementary  operations  can  be  implemented  by  p  processors  within  a 
time  of  fx/pl  +  t  . 

Prop f  of  Brent's  theorem.  Let  x.  denote  the  number  of  operations  performed 
by  the  algorithm  in  time  i  (. )  x.  =  xj  .  We  now  use  the  p  processors  to  "simulate" 
the  algorithm.  Since  all  the  operations  in  time  i  can  be  executed  simultaneously, 
they  can  be  computed  by  the  p  processors  in  Px./p'l  units  of  time.  Thus,  the  whole 
algorithm  can  be  implemented  by  p  processors  in  time  of 


)   I'x^/pl  <   )   (x^/p  +  1)  <  [x/pl  +  t 


Remark.  The  proof  of  Brent's  theorem  poses  two  implementation  problems.  The 
first  is  to  evaluate  x.  at  the  beginning  of  time  i  in  the  algorithm.    The   second 

is  to  assign  the  processors  to  their  jobs. 

Recall  the   following   standard   deterministic   parallel   algorithm   for  the 

list-ranking  problem   (defined   in   the   Introduction).    Say   that   we   have    n 

orocessors.  Assign  a  processor  to  each  of  the  n  elements.   Denote  the  pointer  of 

element  i  of  the  input  array  by  D(i)  and  initialize   R(i)  :=  1  ,   1  <  i  <  n  .    We 
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seC  n(t)  :=  "end  of  list"  (where  t  is  the  last  element  in  the  linked  list),  r)("end 

of  list")  :=  "end  of  list"  and  R("end  of  list")  :=  0  . 

Apply  I  log  n|  iterations: 

for  processor  i,  1  <  i  <  n,  pa_rdo  (perform  in  parallel) 

R(i)  :=  R(i)+R(D(i));  D(i)  :=  D(D(i))  (To   be   called   the   short-cut   operation). 

(See  Fig.   2) 

Mote  that  a  total  of  ^(nlog  n)  short-cuts  is  required  in  this  algorithm.  It 
runs  in  time  of  O((nlog  n)/p  +  log  n)  using  p  processors  on  an  EREW  PRAM  and 
solves  the  list  ranking  problem  into  the  vector  R. 

Implementation  Rema rk  j_.  In  order  to  derive  this  running  time  from  Irent's 
theorem  n  has  to  be  broadcasted  to  all  p  processors.  This  takes  additional 
O(log  p)  time. 

Proposition  1.  (This  is  a  variant  (due  to  [AV-79],  p.  158)  of  Chernoffs' 
bounds)  For  all  i  ,c\,S>    with  1  <  q  <  1,  0  <  3  <  1 


"'  ..r<J«,i  ifJ^"-^)'-^<-p<-«^»^/3) 


Let  X  be  a  random  variable  having  the  hypergeometric  distribution  with 
parameters  i  ,  Nq,  N  (0  <  q  <  1).  One  way  to  demonstrate  this  distribution  is:  we 
have  a  bag  containing  N  balls  Nq  of  them  are  white  and  the  rest  black.  X  denotes 
how  many  white  balls  we  get  while  sampling  i  balls  without  replacements.  Let 
Lj^  £  C^*^^  ^^  Prob(X<  C)  for  any  C. 

Let  Y  be  a  random  variable  having  the  binomial  distribution  with  parameters  i  ,  q. 
In  the  above  example  Y  denotes  how  many  white  balls  we  get  while  sampling  i  balls 
with  replacements.   Let  L^^  ^(q)  be  Prob(Y<C).   See  [F-50]. 

Proposition  2.   (Due  to  Uhlmann,  cf.   [JK-69],  p.  151). 


(a)  L^  ^  (,(q)  -  L^  ^(q)  >  0   for  0  <  q  <  C(£-l)  ^(N+l)~^ 

(b)  4  £,0^^^^  "  ^,C^^)  <  "   for  I  >  q  >  (C(Jl-l)"^N  +  1)(M+1)-^ 


Balanced  binary  tree  parallel  algorithms. 

One  simple  pattern  of  optimal  speed-up  deterministic  parallel  algorithms  is 
the  balanced  binary  tree.  This  pattern  was  used,  among  many  others,  by  [W-79], 
[CLC-81]  and  [Vi-81].  Let  us  first  demonstrate  this  pattern  on  the  problems  of 
computing  sums  and  partial  sums. 

Input.    An   array  of  n  numbers  A( 1 ) , A(2) , . . . ,A(n) .   Assume,  w.l.g.   that  log^n  is 
an  integer. 

Problem.   Compute  their  sum. 

Algorithm.   "Plant"  a  balanced  binary  tree  with  n  leaves.   Every  node  of  the   tree 
is   denoted   [h,j].    See   Fig.    3.   Leaf  [0,j]  corresponds  to  A(j).   Associate  a 
number   B[h,j]     with    every    node     of     the     tree.      Initialization. 
for  all  1  <  j  <  n  pardo  3 [ 0 , j ]  : =  A( j ) . 
for  h  :=  1  _to_  log  n 
for  all  1  <  j  <  2^°§  "  "  ^  pardo  B[h,j]  :=  B[h-l,2j-l]  +  B[h-l,2j]. 

B[log  n,l]  holds  the  desired  sum. 

Think  first  about  an  n  processor  implementation  of  this  summation  algorithm. 
It  runs  in  O(log  n)  time.  Then  apply  the  proof  of  Brent's  Theorem  to  get  an 
alternative  implementation  that  uses  only  n/log  n  processors  and  runs  in  O(log  n) 
time.  This  summation  algorithm  can  be  extended  to  solve  the  following  partial-sum 
problem. 

Input.   Same  as  for  the  summation  problem. 
Problem.   Compute  f  A(j)  for  all  1  <  i  <  n. 
Algorithm.     Perform   the    summation   algorithm  given   above.    An   additional 


"down-sweep"  of  Che  tree  (from  the  root  to  the  leaves),  which  roughly   amounts   to 
reversing  the  operation  of  the  summation  algorithm,  will  complete  the  job: 
Associate  another  number  C[h,j]  with  each  node  [h,j]. 
Initialization.   C[log  n,l]  :=  0. 
for  h  :=  log  n-1  downto  0 

for  all  1  <  j  <  2l°§  ^  -  ^  pardo   if_  j  is  odd     then  C[h,j]  :=  C[h+1 , ( j+1 )/2 ] 
else  C[h,j]  :=  C[h+l,j/2]  +  B[h,j-I]. 
for  all  1  <  j  <  n  pardo  C[0,j]  :=  C[0,j]  +  B[0,j]. 

C[0,j],  1  <  j  <  n,  hold  the  desired  partial-sums.  This  algorithm  can  also  be 
implemented  to  run  in  0(n/p  +  log  n)  time  using  p  processors  on  an  EREW  PRA.M. 
(Apply  Brent's  theorem  and  Implementation  Remark  1.) 

A  wishful  thinking.  We  want  to  find  an  algorithm  for  the  list  ranking 
problem  whose  total  number  of  short-cuts  is  0(n).  If  we  could  "plant"  a  balanced 
binary  tree  in  our  linked  list  (in  the  order  of  the  linked  list)  it  would  have 
solved  our  problem:  enter  one  at  each  leaf  and  apply  the  partial  sum  algorithm.  A 
closer  look  at  the  summation  part  of  such  a  partial  sum  computation  reveals  the 
following: 

The  operation  of  the  for  statement  for  h=l  corresponds  to  short-cuts  at  all  even 
(relative  to  the  linked  list)  locations.  This  results  in  a  new  linked  list  that 
connects  only  even  locations  of  the  original  list,  thereby,  halving  its  length. 
Then,  the  for  statement  for  h=2  corresponds  to  short-cuts  at  even  locations  of  the 
new  linked  list  and  so  on.  See  Fig.  4.  To  sum  up:  The  for  statement  of  the 
summation  algorithm  never  performs  a  short-cut  at  two  successive  elements  of  the 
linked  list  at  hand;  and,  therefore,  the  "input"  to  any  operation  of  this  for 
statement  is  a  single  linked  list. 

(Remark.  The  problem  is  of  course  that  we  do  not  know  how  to  plant  a  balanced 
binary  tree  with  respect  to  the  linked  list  without   actually   solving   first   the 
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list   ranking   problem  itself.   Since  this  "planting"  needs  the  ranking  mod  2,  mod 
4,  mod  8,...   as  explained  above). 

In  our  randomized  algorithms  we  plant  "randomly  balanced  trees".  That  i^, 
the  short-cut  operations  are  picked  at  random  such  that  we  never  Der'^orm 
simultaneous  short-cuts  at  two  successive  locations  of  the  linked  list.  Therebv, 
we  move  iteratively  from  one  linked  list  to  another  single  shorter  linked  list. 

III.  The  First  Algorithm 

The  first  algorithm  forms  the  closest  (among  our  algorithms)  randomized 
analogy  to  the  partial-sum  algorithm.  In  particular,  its  first  part  Is  a 
randomized  analogy  to  the  summation  algorithm.  Its  second  part  (Step  5  below)  is 
similar  to  the  extention  of  the  summation  algorithm  to  the  partial-sum  algorithm. 

The  algorithm  uses  p  <  n/(log  nloglog  n)  processors  on  an  EREW  PRAM. 
Initialization,  m  :=  n.  Each  processor  is  assigned  to  a  successive  segment  of 
length  n/p  in  the  input  array.  Similar  to  the  deterministic  algorithm  denote  the 
pointer  of  element  i  by  D(i)  and  initialize  R(i)  :=  1  ,  1  <  i  <  n  .  We  set 
D(t)  :=  "end  of  list"  (where  t  is  the  last  element  in  the  list),  D("end  of 
list")  :=  "end  of  list"  and  R("end  of  list")  :=  0  . 

while  m  >  n/log  n  do 

(Comment.   The  input  to  each  iteration  of  this  while  loop   is   a   linked   list   of 

length   m   stored   in  an  array  of  length  m.  The  vector  0  contains  for  each  element 

its  subsequent  element  in  the  linked  list.) 

Processor  i,  1  <  i  ^  p,  is  assigned   to   segment   [(i-l)m/p  +  l,...,ira/p]   in   the 

array   which   forms   the  input  to  this  while  loop.   (Assume  w.l.g.   that  m/p  is  an 

integer. ) 

for  Processor  i,  1  <  i  <  p,  pardo 
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Step  1.   for  j  :=  (i-l)m/p  +  1  _to  im/p  do_ 

Toss  a  coin;  Assign  the  result  of  the  coin  to  c(j);  SURVIVE(j)  :=  1. 
(Comments.   Each  coin  gets  0  or  1  with  probabilty  one  half.   Assume  that  cC'end  oc 
list")   is   always  0.  SURVIVE(j)  is  initialized  to  1.  It  implies  that  element  j  is 
included  in  the  output  linked  list  of  this  iteration  of   the  while   loop   unless 
SURVIVE(j)  is  set  to  0  in  Step  2.) 
Step  2.   for  j  :=  (i-l)m/p  +  1  to_  ira/p  do 

for     each     element      j     such     that     c(j)=0    and     c(D(j))=l     do_ 
OP(i,t)  :=  (D(j),j,R(j));  SURVIVE(D( j ) )  :=  0; 
R(j)  :=  R(j)  +  R(D(j));  D(j)  :=  D(D(j))  (shortcut). 

(Comments.  1.  The  shortcut  operation  cannot  be  applied  to  two  successive  elements 
in  the  linked  list.  2.  Each  element  whose  predecessor  in  the  list  performed  a 
shortcut  remains  with  no  incoming  pointers.  It  is  "deleted"  in  Step  3.  The 
instruction  SURVIVE(D( j ) )  :=  0  takes  care  of  this.  3.  The  parameter  t  stands  for 
the  present  time.  The  information  in  OP(i,t)  enables  us  to  reconstruct  later  the 
operation  of  processor  i  at  time  t.  This  is  used  in  Step  5  to  derive  the  final 
value  of  R(D(j))  from  the  final  value  of  R(j)). 

Step  3.  Perform  the  balanced  binary  tree  partial-sura  computation  described  in  the 
previous  section  with  respect  to  the  vector  SURVIVE.   As  a  result: 

(1)  m  :=  I    SURVIVE( j),  and 

J- 

(2)  each  element  j  with  SURVIVE(j)=l  gets  its  entry  number  in  a  (contracted)  array 

of  length  ra  containing  the  output  linked  list. 

(This  array  is  the  input  for  the  next  (if  any)  iteration  of  the  while  loop.) 

od 

Let  T^  (resp.   T^)  be  the  first  (resp.   last)  time  unit  for   which   an   assignment 

into  0P(  ,  )  was  performed. 
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Step  4.    Apply  a  simulation  of  Che  deterministic  algorithm  by  p  processors  to  the 

current  array. 

Step  5. 

for  Processor  i,  1  <  i  <  p,  pardo 

for  t  :=  T^  downto  T^  do 

R(OP(i,t).l)  :=  R(0P(i,t).2)  -  0P(i,t).3   . 

(Comment.   OP(i,t).k  ,  k=l,2,3,  represents  the  fields  of  OP(i,t)). 

Implementation  remark.   Each   time   m   gets   a   new   value   broadcast   it   to   all 

processors  as  in  Implementation  Remark  1  of  the  previous  section. 

Comolexi tv. 


Theorem.  The  algorithm  runs  in  time  0(n/p),  with  probabiltv 
1  -  0(e   '^^'■'-°8  ^O  ,  using  p  <  n/(log  nloglog  n)  processors. 

Proof.  Each  iteration  of  the  while  loop  takes  a  total  of  OCm/p  +  log  m) 
time.  Step  4  takes  0(n/p)  time.  The  time  for  Step  5  is  bounded  by  the  time  for 
the  while  loop. 

Denote  the  length  of  the  input  array  to  iteration  i  of  the  while  loop  bv  m.  , 
for  i  =  1,2,...  .  The  quality  of  our  algorithm  is  determined  by  the  actual 
values  of  the  sequence  m, ,mT,... 

Lemma.  There  exists  a  positive  integer  a^  such  that  for  all  n  >  n^,  the 
following  is  satisfied:  the  probability  that  m.  <  (15/16)''""  n  ,  for  all 
i  =  2,3,...  ,  is  1  -  o(e-^'^^"/l°g  ""h    . 

In  this  case:  (1)  the  number  of  iterations  of  the  while  loop  is   O(loglog  n) ;   and 
(2)  the  time  spent  on  the  while  loop  is 

O(loglog  n) 
0(       ;"      ((15/16)^n/p  +  log  n)J  =  0(n/p  +  log  nloglog  n) 
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and  Che  Theorem  follows. 

It  remains  Co  prove  Che  Lemma. 
ObservaCion  1.    Element   j   performs   a   shorccuC   in  Scep  2   with   probabilicv 
1/4  =  (Prob(c(j))=0)(Prob(c(D( j))=l)).   (Unless  D(j)  is  "end  of  list"). 
Observation  2.   Let  i  and  j  be  two  elements  in  even  locations  of  Che  linked   list. 
Then  their  probabilities  for  performing  shortcuCs  in  Seep  2  are  independenc. 

We  show  that  with  high  probability  sufficiently  many  shorccuCs  in  even 
locacions  are  performed.  For  Chis  apply  Proposicion  1(a)  of  Che  previous  section 
with  £=n/2,  q=l/4  and  3=1/2  .  The  probability  that  following  one  iteration  <  n/S 
of  Che  n/2  possible  shorc-cuCs  in  even  locaCion  are  performed  is  <  e"'^'''^  .  This 
bounds  also  Che  probabilicy  that  m^  ,  the  length  of  the  list  following  one 
iteration,  is  >    15n/16  .   Let  i  >  1  be  an  inceger.   Suppose  chac  for  all  2  <  j  <  i 

mJ  <  (15/16)J~  n  .      By    similar    consideracions     the    probabilicy    CliaC 

-m./64 
^l+l   <  (15/16)^n  is  >  1  -  e   ^ 

We  already  implied  ChaC  the  number  of  iterations  in  case  the  sequence  of  Che 
m^-s  sacisfy  Che  bounds  of  Che  Lemma  is  O(loglog  n).  The  probability  that  Chis 
will  happen  is  >  (l-e-^'''^^lo8  n)0(loglog  n)  ^  ^  _  Q^^-n/eAlog  n^^Qg^^g  ^^  ^ 
Obviously,  chere  exisCs  n^  such  ChaC  for  all  n  ^  nQ  Chis  probabilicy  is 
>    1  -  o(e~"  ^"' -'■°S  ^').      This  converges  very  rapidly  Co  1  as  n  grows. 

IV.  The  Second  Algorichm 

ThroughouC  Chis  seccion  we  use  p  =  n/(log  nlog  n)  processors  on  an  "REW  PRAM. 
In  the  algorithm  below  processors  are  assigned  to  elements  through  random 
permuCaCions.  We  use  Che  facC  ChaC  wich  very  high  probabilcy  only  few  of  Che 
processors  are  assigned  Co  subsequenc  elemencs. 

InicializaCion.    m  :=  n.    Veccors   R  and  D  are  defined  and  inicialized  as  in  Che 
firsC  algorichm.   ThroughouC  this  section  we  use  also  T)~  ,  the  inverse   of   D   (to 
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form   a   doubly    linked   list).    0~^   is   initialized   in  0(n/p)   time   in   a 
straightforward  manner. 

while  m  >    c,n/log  n  do 

(Comment.   c,(>l)  is  some  proper  constant.   We  elaborate  on  how  to   select   c^   in 

our  complexity  analysis.) 

Step  I.   Take  a  random  (precomputed)  permutation  a  of  l,2,...,m.   (\ssume,  w.l.g., 

that  m/p  is  an  integer).   Assign  processor  i,  1  <  i  <  p,  to   the   segment   of   m/p 

elements  [(i-l)m/p  +l,...,ira/p]  in  the  domain  of  a. 

Step  2.    Processor   i,   1  <  i  <  p,   scans   its  segment  from  left  to  right  in  m/p 

pulses.   At  pulse  t,  1  <  t  <■    m/p,  processor  i  is  at  the  a((i-l)m/p  +  t)  element  of 

the   input   array.    Denote   this  element  by  a.     and  its  predecessor  (if  exists) 

D'Va.  ^)    by  b-  ^.   Processor  i  marks  a^  ^  as  "accessed  at  pulse  t". 

Processor  i^  at_  pulse  t_. 

if  D(a.   )  is  marked  as  accessed  at  pulse  t  or  a^  ^  is  the  tail  of  the  list 

then  SURVIVE(a^  ^)  :=  1 

else  OP(i,t)  :=  (a^  ^  ^ '^i  ,  t '^^^i  ,  t^  ^ '  ^^^i.t^  '=  ^^^i,t^  ^  ^^^i,t^'' 

SURVIVE(a.  J.)  :=  0;  D(b.  ^)  :=  D(a.  ^)   (shortcut);  D"kD(b.^^)  :=  b-^^  . 

(Comment.    The   only  case  where  D~^(a.  ^)    does  not  exist  is  when  a^  ^    is  the  tail 

of  the  list.) 

(Explanation.   If  currently  there  is  no  other  processor  at  the   element   ahead   of 

a.     a   shortcut   is   performed.    Note,   that  unlike  the  previous  algorithms  the 

shortcut  detours  a.    itself  rather  than  its  successor.   This  change  is   in   order 

to  avoid  the  possibilty  that  a.    was  detoured  in  previous  pulses). 

Step  3.    A  contraction  of  the  remaining  linked  list  into  an  array  is  performed  as 

in  Step  3  of  the  first  algorithm.   Assign  their  number  into  m. 
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( Comment.   Later  we  refer  to  the  elements  not  in   the   remaining   linked   list   as 
deleted. ) 
od 

Let  T,  (resp.  T2)  be  the  first  (resp.  last)  time  unit  for  which  an  assignnent 
into  0P(  ,  )  was  performed. 

Step  4_.   Apply  a  simulation  of  the  deterministic  algorithm  by  p  processors  to   the 
<    c^n/log  n  remaining  elements. 
St ep_  _5_. 

for  Processor  i,  1  <  i  <  p,  pardo 
for  t  :=  T2  downto  T,  do_ 
R(OP(i,t).l)  :=  R(0P(i,t).2)  -  0P(i,t).3   . 

Implementation  remarks.  1.  Each  time  m  gets  a  new  value  broadcast  it  to  all 
processors  as  in  Implementation  Remark  1  of  Section  2.  2.  The  actual  series  of 
values  of  m  is  itself  a  random  variable  and  is  not  known  in  advance.  Question. 
Do  we  have  to  store  precomputed  random  permutations  on  [l,2,...,m]  for  every 
possible  m?  Answer.  Store  random  permutations  only  for  powers  of  2  (for 
"sufficiently  large"  numbers).  Given  an  m  take  a  random  permutation  for 
[  1 ,2  ,  . . .  ,2  ' -"-^S  ™']  and  "contract"  it  into  a  (random)  permutation  on  [1,2,... ,m] 
similar  to  the  way  in  which  the  vector  SURVIVE  is  used  to  contract  the  arrav 
containing  the  linked  list.  This  will  not  affect  time  complexity  by  more  than  a 
constant  factor. 

Complexity. 

Theorem.    The   algorithm   runs   in   time   0(n/p)  (=  O(log  nlog  n)  ),    with 
probabilty  1  -  0(2"^^  (n/(log  nlog*n)2)^  ^ 

Proof.    Each   iteration   of   the   while  loop  takes  a  total  of  0(m/p  +  log  m) 
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tirae.  Step  4  takes  0(n/p)  time.  The  time  for  Step  5  is  bounded  by  the  time  for 
the  while  loop. 

Denote  the  lengths  of  the  input  arrays  to  iteration  i  of  the  while  loop  bv 
m.  ,  i  =  1,2,...  .  The  quality  of  our  algorithm  is  determined  by  the  actual 
values  of  the  sequence  mpra^,... 

Lemma  1 .  There  exist  a  constant  c,  >  1  and  a  positive  integer  nQ  such  that 
for   all    n  >  n^    the    following    is    satisfied:    the    probability     that 

m^  <  c^piog  (n/p)  ,    for   all   i  =  2,3,...  ,   is   >  1  -  o(e-^^^/(^°g  nlog*n)-))  ^ 

i  0  i        i-1 

(Where  log   is  defined  as  follows:  log  is  the  empty  string  and  log  =  log  log  ). 

In  this  case:  (1)  the  number  of  iterations  of  the   while   loop   is  <  log  n   (This 

follows  readily  from  our  assumption  p  =  n/(log  nlog  n));  and  (2)  the  time  spent  on 

the  while  loop  is 

,  O(log*n)    i-1 
0[       )   ((plog  (n/p))/p  +  log  n)J 

The  log  n  term  of  the  summation  contributes  O(log  nlog  n)  to  the  total.  '^or 
i=l  the  first  element  of  the  summation  is  n/p  =  log  nlog  n  .  ^or  i> 2  this  term  is 
O(log  n) ,  and  therefore  the  total  is  O(log  nlog  n)  ,  and  the  Theorem  follows.  The 
rest  of  this  section  is  devoted  to  prove  Lemma  1. 

An  iteration  i  of  the  while  loop  starts  with  an  array  of  ra(=m.)  elements 
{l,...,ra}  which  forms  a  linked  list.  Till  Corollary  1  below  we  consider  only  the 
first  pulse  of  iteration  i.  In  the  first  pulse  a  subset  of  size  o 
I^  =  {a(l),a(m/p  +  1),...}  is  selected.  The  selected  subset  can  be  partitioned 
into  intervals  as  follows:  two  elements  of  the  subset  belong  to  the  same  interval 
if  all  the  elements  among  them  in  the  linked  list  belong  to  the  subset.  Let 
R  =  R(m,p)  be  the  random  variable  representing  the  number  of  intervals.  Observe 
that  having  r  intervals  imolv  that  the  number  of  short-cuts  oerformed  in  Step  2  is 
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r  or  r-1  (remember  the  tail  of  the  list  case  that  happens  once  during  an 
iteration).  Thus,  as  a  result  of  the  first  pulse  the  following  elements  are  still 
in  the  linked  list:  the  m-p  elements  that  we  did  not  "touch"  plus  <  p-r+1  ele-ients 
(to  be  called  "survivors")  where  a  shortcut  could  have  been  but  was  not  performed. 
Lemma  2-  gives  the  distribution  of  R(m,o). 

Lerama  2.   Prob(R(m,p)=r)   the  probabilty   that  R(m,p)=r,    1  <  r  <  p,    is 

i  p-1  w  m-p+li  /;  mi 
*'p-r''>'   r   -I  '*■  pJ  • 

^roof .   The  linked  list  is  being  accessed  through  a  random  permutation.   (\11 
permutations  are  equally  likely). 

Claim.   The  number  of  permutations  for  which  the  p   elements   being   accessed 
form  r  intervals  in  the  linked  list  is  (,  Pl^j  p!(  ^"^''"^J  (m-p) ! 

The   claim  together  with  the  fact  that  the  total  number  of  permutations  is  m! 
imply  Lerama  2. 

Proof  of  claim.   (a)  The  number  of  possibilities  to  partition  the  p   elements 
into  r  non-empty  intervals  is  i.Pl|jp!  . 

Let  the  o  elements  have  p  possible  locations  in  a  row.  Put  a  "divider"  between 
each  successive  pair  of  locations.  Select  r-1  of  the  p-1  dividers  and  sort  the  p 
elements  into  their  possible  locations. 

(b)  The  number  of  possibilities  to  partition  the  m-p  remaining  elements  into  r-rl 
intervals  such  that  the  r-1  middle  intervals  are  non-empty  is  i,  ^"^"^^ J  (m-p)  !  . 
We  look  at  m-p+1  pebbles  in  a  row.  Select  r  out  of  them.  The  length  of  the  r+1 
intervals  is  now  identified  as  follows:  The  first  (resp.  the  last)  interval  has 
the  same  number  of  elements  as  the  number  of  pebbles  to  the  left  (resp.  right)  of 
the  leftmost  (resp.  rightmost)  selected  pebble.  The  number  of  elements  of  the 
i-th  interval,  2  <  i  <  r,  is  one  plus  the  number  of  pebbles  between  the  (i-l)-st 
and   the   i-th   selected   pebbles.   It  can  be  readily  seen  that  this  defines  a  1-1 
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correspondence  onco  Che  set  of  such  Intervals.  Finally,  sort  the  ra-p  elements 
into  their  possible  locations. 

This  completes  the  proofs  of  the  claitn  and  Lerana  2. 

Lemma  2  shows  that  the  distribution  of  R(m,p)  is  actually  hyoergeonetric. 

Define  another  random  variable  X  =  p  -  R(m,p).  The  distribution  of  X  is 
hypergeometric  with  parameters  i=p,  q=(p-l)/in,  N'=ra,  Nq  =  o-l  (see  Sec.  2)  and 
therefore  E(X)  =  p(p-l)/m  (see  [F-50]). 

Let  Y  be  a  random  variable  whose  distribution  is  binomial  with  parameters 
i=p,  q=(p-l)/m  . 

Lemma  3.   Prob(Y  >  C)  >  ?rob(X  >    C)   for  C  >    p(p-l)/m  . 

Proof.  In  order  to  apply  Proposition  2(a)  (Sec.  2)  we  have  to  show  that 
q  =  (p-l)/ni  is  less  than  or  equal  C(i -1)~^'(N+1)"^  which  is 
>  (p(p-l)/m)( l/(p-l)(m/(m+l))  =  (p/ra) (m/(m+l ) )  .  Since  ra  is  much  greater  than 
p  ,    p-1  <  pm/(m+l)   and  the  required  inequality  follows,   o 

Corollary  1.   Prob(X  >    (  3/2)p(  p-1) /m)  <  e'P^P"^  ^ '' ^^m  ^ 

Proof.  3y         Proposition         1(b)         (Sec.         1), 

Prob(Y  >  (3/2)p(p-l)/m)  <  e'^^P-^^'^^m  ^ 

Remark.  We  selected  arbitrarily  2=1/2  for  the  application  of  Proposition  1(b). 
In  the  analysis  it  is  possible  to  trade  lower  probabilty  for  success  (smaller  3) 
for  a  faster  algorithm. 

So  we  got  an  upper  bound  on  the  probabilty  that  the  number  of  survivors  of 
the  first  pulse  of  an  iteration  of  the  while  loop  exceeds 
(3/2)p(p-l )/m  +  1  (recall  that  the  number  of  survivors  is  <  p-r+1  ). 

N'ext  we  show  how  to  apply  this  analysis  of  the  first  pulse  to  the  pulses  that 
follow.    Let  us  start  with  the  second  pulse  where  another  subset  of  size  p  of  the 
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linked  list  I2  =  {a  (2)  ,a  (ra/p+2) , . .  . }  is  selected.  In  order  to  sinfiplify  the 
analysis  our  definition  of  intervals  does  not  take  advantage  of  the  fact  that 
survivors  of  the  first  pulse  may  increase  the  number  of  actual  intervals:  Two 
elements  of  I2  belong  to  the  same  interval  if  all  the  elements  among  them  in  the 
current  linked  list  either  belong  to  It  or  are  survivors  of  the  first  oulse.  The 
main  observation  is  that  the  number  of  intervals  of  the  second  pulse  meets  the 
distribution  of  R(ra-p,p).  The  reason  for  this  is  that  after  the  first  pulse  we 
virtually  delete  p  elements  from  the  domain  of  the  permutation  a  ( { 1 ,m/p+l , . . . } ) 
as  well  as  from  the  range  of  a  (the  image  of  { 1  ,m/p+l  , . .  . }  -  the  set  ip.  Thus, 
all  possible  permutations  on  the  remaining  m-p  elements  of  the  domain  ara  equally 
likely  once  the  mapping  of  the  p  elements  { 1 ,m/p+l , . . . }  is  fixed. 

The  analysis  of  the  first  pulse  carries  through  and  results  in  the  following: 
The   probabilty   that   the   number   of   survivors   of   the   second  pulse  satisfies 

>  (3/2)p(p-l)/(m-p)  +  1  is  <  e-P(P-l)'  1-^'""P^  . 

The  definition  of  intervals  at  the  subsequent  pulses  will  likewise  not  take 
advantage  of  survivors  of  their  preceding  pulses.  The  distribution  of  the  number 
of  intervals  at  pulse  i,  1  <  i  <  m/p  ,  meets  the  distribution  of  R(m-( i-1 )p ,p) . 
And,   the   probabilty   that  the  number  of  survivor  of  the  (i+l)-st  pulse  satisfies 

>  (3/2)p(p-l)/(m-ip)  +  1  is  <  e"P<^P"^  ^/ ^^(m-ip)  ^  (^ote  that  our  analysis  fails 
completely  in  measuring  the  gain  in  the  (m/p)-th  pulse). 

Summing  up  the  number  of  survivors  over  the  m/p   pulses  gives 

""  f  ~^  ((3/2)p(p-l)/(m-ip)  +  1  <  (3/2)(p-l)  r  I  (l/((ra/p)-i)j  +  m/p 

i=0  i=0 

<  (3/2)(p-l)riog(ra/p)1  +  m/p       (4.1) 

The    probabilty    that    this    will    be    the    number    of   survivors   is 

>  (1  -  e~P*^P~^  )''^^'"). .  .  ( 1  -  e~'^^P~^^''^-^™"^P^^*  •  • 
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We   may   weaken   slightly   this   bound   by   observing   that   this   probabilty    is 

>  1  Jff-^    ^-p(p-l)/12(m-ip)  J  ,  1  _  (^,p)3-p(p-I)/12m       (^,,)^ 

Recall  that  c,n/log  a  <  n  <  n  . 

Corollary  2.  We  consider  an  iteration  of  the  loop  that  starts  with  n 
elements.  For  any  constant  c,  >  3/2  there  exists  n^  such  that  for  all  n  >  n^  the 
probabilty   that    this    iteration   results    in   <  C2plog(m/p)   survivors   is 

>  1  -  o(e"^'^"''<^^°§  ^^°S  n)-)) 


The  selection  of  c,  for  Lemma  1 .    We   select   c,   such   that   the   following 

i-1 
assertion  is  satisfied.   Let  s.  =  c,plog  (n/p)  .   Obviously,  m,  <  s,  . 


Assertion, 


If   ra.  ,  the  input  length  for  the  i-th  iteration  of  the  loop,  is 


<  s.    and   m.  >  c,n/lon  n   (the   repeat   condition   of   the   while    loop)    then 
C2plog(m./p)  <  s^+^  . 

We   want    that    C2plog(m./p)  <  ^i+l  •    Since   m^  <  s^   it   suffices   that 

i-1  i 

C2pIog(c^p(log  (n/p))/p)    <  c^plog  (n/p)     which     is      the     same     as 

i  i  i 

Codog  Cj  +  log  (n/p))  <  c^log  (n/p)  and  c^log  c^  <  (cpC2)log  (n/p)     (4.3). 

The  repeat  condition  of  the  while  loop  shows  that  for  c-^  >  C2  and 
sufficiently  large  n  (4.3)  is  satisfied  for  all  n  >  n^  .  Observe  that  if  C|  is 
selected  to  be  slightly  greater  than  c^  then  a  small  increase  in  c,  may  cause  a 
sharp  decrease  in  n^-,  ,  o 

We  are  ready  to  finish  the  proof  of  Lemma  1.  It  was  already  argued  that  if 
m^  <  s^  ,   for   all   i=2,3...  ,   then   the   number   of   iterations   of  the  loop  is 

<  log' n  .   Therefore,  Corollary  2  (and  considerations  that  led  to  it)  implies  that 
the      probabilty      that      m^^  <  s^      for      all     i  =  2,3, ,      is 

<  1  -  0((log*n)e'^*^"''^^°S  nlog*n)-))  ^  ^  _  Q(g-(^  (n/(  log  nlog  n)")) 
This  completes  the  proof  of  Lemma  1. 

Reraark_.   Recall  that  the  selection   of   C2   (which   is   a   parameter   of   the 
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complexity  analysis  only)  was  arbitrary  (>  3/2).  Besides  its  dependence  on  c^  ^^*^ 
selection  of  c^  is  arbitrary  as  well.  The  selection  of  c^  affects  both  the  time 
distribution  of  the  while  loop  and  the  time  of  Step  4.  Our  analysis  was  aimed  r i 
show  that  there  exists  a  choice  of  c,  for  which  the  running  time  is  almost  sure  1 7 
0(n/p)  using  p  processors  provided  that  n  is  sufficiently  large.  We  did  not 
intend  to  exhaust  the  various  trade-offs  possible  as  a  result  of  alternative 
choices  for  c,  (and  Co).   We  leave  this  to  the  interested  reader. 

V.  The  Third  Algorithm 

In  this  section  the  number  of  processors  p  is  nlnln  n/ln  n  on  a  CRCW  PRAM. 
Initialization  is  as  in  the  first  algorithm.  In  addition,  SURVIVE(j)  is 
initialized  to  1  for  all  1  <  j  <  n  . 

for  Processor  i,  1  <  i  <  p,  pardo 

Step  1.    Select   a   random  number  N.   between   1   and  n,  (Prob(N.=j)  =  1/n  for 

1  <  j  <  n) ,  and  "mark"  the  element  in   location  N.   of   the   array.    If   several 

processors   select   the  same  number  only  one  marks  the  element  and  the  others  quit 

till  Step  3. 

Step    2.      COUNTER(i)    :=   0  ' 

while   COUNTER(i)  <     fin   nl    and   D(N^)    is    not   marked   do 

OP(i,t)    :=    (D(N^),Nj^,R(Ni));    SURVIVE(D(N^) )    :=   0; 

R(N.)    :=   R(Ni)    +  RCDCN^));    DCN^)    :=   DCDCN^)) 

COUNTER(i)  :=  COUNTER(i)  +  1 

od 

(Explanation.   Starting  with  the  selected  element  a  processor  shortcuts   over   its 

In  n   subsequent   elements   in  the  linked  list  or  till  it  hits  an  element  that  was 

selected  by  another  processor.   This  portion  of  the   linked   list   is   called   the 
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chain   of  this  processor.   The  result  is  that  each  processor  replaces  its  chain  by 
a  pointer  from  the  tail  (the  selected  element)  to  the  head  of  the  chain.) 
The  three  comments  following  Step  2  of  the  first  algorithm  applv   here   as   well.. 
Note  that  in  the  second  assignment  R(D(N.))  is  always  1.) 

Let  T^  (resp.  T9)  be  the  first  (resp.  last)  time  unit  for  which  an  assignnent 
into  0P(  ,  )  was  performed. 

Step  3.  A  contraction  of  the  remaining  linked  list  into  an  array  is  performed  as 
in  Step  3  of  the  first  algorithm. 

Step  4.    Apply  a  simulation  of  the  deterministic  algorithm  by  p  processors  to  the 
current  array. 
Step  5. 

for  Processor  i,  1  <  i  <  p,  pardo 
for  t  :=  T2  downto  T^  d£ 
R(OP(i,t).l)  :=  R(0P(i,t).2)  -  0P(i,t).3   . 

Implementation  remark.  A  slightly  weaker  concurrent-write  assumption  (than  the 
one  given  in  the  definition  of  the  models  in  the  introduction)  suffices  for  this 
algorithm:  In  case  more  than  one  processor  attempts  to  write  into  the  same  memory 
location  one  of  these  processors  succeeds  but  we  do  not  know  in  advance  which. 
Such  an  assumption  was  used  in  [SV-82]. 

Complexity. 

In  Lemma  1  and  corollaries  1  and  2  below  the  fin  n]  tail  elements  of  the 
original  linked  list  are  excluded. 

Lemma  1 .  Let  x  be  an  element  of  the  input  array.  The  probabilty  that  none 
of  the  ! In  nl  elements  that  precede  x  in  the  linked  list  was  selected  by  one  of 
the  (p=)nlnln  n/In  n  processors  in  Step  1  is  <  1/ln  n  . 
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Proof.  The  probabilty  that  these  fin  nl  elements  do  not  include  the  element 
selected  by  processor  i,  1  <  i  <  p,  is  <  (n-ln  n)/n.  Since  the  selection  of 
numbers   by  distinct  processors  is  independent  the  probabilty  that  the  i In  nl  list 

elements    preceding   x    do    not     include    a     selected    element     Is 

/I     Inln  n 

<  ((n-ln  n)/n)P  =  [(1  -  In  n/n)"'-"-"  "]       .     Since    the   monotone   increasing 

sequence  {(1  -  1/n)"}  converges  to  1/e  we  get  <  ( 1/e)-'-^-'-"  ^   =  1/ln  n. 

Corollary  1.  The  expected  number  of  elements  that  none  of  the  I  In  nl 
elements   preceding   them  in  the   linked   list  was   selected  by  a  processor  is 

<  n/ln  n  . 

Corollary  2.   The  number   of   elements   that   none   of   the   fin  nl   elements 
preceding   them   in  the  linked  list  was  selected  by  a  processor  is  <  nlnln  n/ln  n 
with  probabilty  1  -  1/lnln  n  . 

Proof.  Let  E(X)  be  the  expectation  of  a  random  variable  X  that  gets  only 
nonnegative  values.  Then,  in  general,  Prob(X  >  tE(X))  <  1/t  .  The  number  of 
elements  that  none  of  the  fin  nl  elements  preceding  them  in  the  linked  list  was 
selected  by  a  processor  is  always  non-negative.  Therefore,  Corollary  2  follows 
from  Corollary  1 . 

Theorem.     The    length    of    the    linked    list    following    Step  2    is 

<  p  +  p  +  fin  nl  (=  2nlnln  n/ln  n  +  fin  nl  )  with  probabilty  1  -  1/lnln  n  . 

Proof.  The  first  p  represents  pointers  from  tails  to  heads  in  chains  of 
processors.  The  second  p  is  from  Corollary  2.  Finally,  there  are  I  In  nl  tail 
elements  that  were  not  considered  above. 

Let  us  sum  up  the  time  spent  at  each  step.  Step  1:  0(n/p).  Steps  2  and  5: 
0(ln  n).  Step  3:  0(ln  n).  Step  4:  0(ln  n)  with  probabilty  1  -  1/lnln  n.  Thus, 
the  Third  algorithm  runs  in  0(ln  n)  time  with  probabilty  1  -  1/lnln  n  using 
nlnln  n/ln  n  processors. 
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